library(tidyverse)
library(SingleCellExperiment)Horizontal Data Integration
To start we will load the tidyverse packages and SingleCellExperiment:
Example data
For this tutorial, we will use a popular dataset by Kang et al. (2018). The dataset measured the effect of interferon-\(\beta\) stimulation on blood cells from eight patients. The muscData package provides an easy way to access the data as a SingleCellExperiment.
sce <- muscData::Kang18_8vs8()see ?muscData and browseVignettes('muscData') for documentation
loading from cache
sceclass: SingleCellExperiment
dim: 35635 29065
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
We log-transform the data to account for the heteroskedasticity, perform PCA to reduce the dimensions, and run UMAP for visualization. For the preprocessing, we will use scater package which adds a new assay called "logcounts" and two reducedDims(sce) called "PCA" and "UMAP" . Equivalent steps exist in Seurat or scanpy.
sce <- scater::logNormCounts(sce)
hvg <- order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
sce <- sce[hvg[1:500], ]
sce <- scater::runPCA(sce, ncomponents = 50)
sce <- scater::runUMAP(sce, dimred = "PCA")To visualize the data, we use ggplot2
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()The following code is based on Seurat’s Guided Clustering Tutorial.
# For more information about the conversion see `?as.Seurat.CellDataSet`
seur_obj <- Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj <- Seurat::NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj <- Seurat::FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 500)
# Subset to highly variable genes for memory efficiency
seur_obj <- seur_obj[Seurat::VariableFeatures(object = seur_obj),]
seur_obj <- Seurat::ScaleData(seur_obj)
seur_obj <- Seurat::RunPCA(seur_obj, verbose = FALSE)
seur_obj <-Seurat::RunUMAP(seur_obj, dims = 1:10)Seurat::DimPlot(seur_obj, reduction = "umap", group.by = "stim")Figure 1 shows that the data separates by the treatment status. For many downstream analyses, it would be good to know how the cells from the stimulated condition are related to the cells from the control condition. For example for cell type assignment, we might want to annotate both conditions together and want to discount the effect of the treatment. This process is called integration.
The goal is get a low-dimensional embedding of the cells where the stimulation does not affect the embedding and all residual variance comes from different cell states. Figure 2 shows a sucessfully integrated example.
Integration approaches
There are many methods for single-cell data integration and Luecken et al. (2022) benchmarked several approaches. Here, I will present four integration methods which are easy to use from R and cover a useful set of features:
- Manual projection
- Automated integration
- Harmony
- MNN
- Invertible integration
- LEMUR
Manual Projection
ctrl_mat <- logcounts(sce)[,sce$stim == "ctrl"]
stim_mat <- logcounts(sce)[,sce$stim == "stim"]
ctrl_centers <- rowMeans(ctrl_mat)
stim_centers <- rowMeans(stim_mat)
ctrl_pca <- irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20, center = FALSE)
ctrl_proj <- t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
stim_proj <- t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)
manual_proj <- matrix(NA, nrow = 20, ncol = ncol(sce))
manual_proj[,sce$stim == "ctrl"] <- as.matrix(ctrl_proj)
manual_proj[,sce$stim == "stim"] <- as.matrix(stim_proj)manual_proj_umap <- uwot::umap(t(manual_proj))
as_tibble(colData(sce)) |>
mutate(umap = manual_proj_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()"stim" condition?
The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.
stim_pca <- irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)
ctrl_proj2 <- t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
stim_proj2 <- t(stim_pca$rotation) %*% (stim_mat - stim_centers)
manual_proj2 <- matrix(NA, nrow = 20, ncol = ncol(sce))
manual_proj2[,sce$stim == "ctrl"] <- as.matrix(ctrl_proj2)
manual_proj2[,sce$stim == "stim"] <- as.matrix(stim_proj2)
manual_proj_umap2 <- uwot::umap(t(manual_proj2))
as_tibble(colData(sce)) |>
mutate(umap = manual_proj_umap2) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()For this example, using the "stim" condition as the reference leads to a worse integration.
The projection approach consists of three steps:
- Centering the data (e.g.,
ctrl_mat - ctrl_centers). - Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (
irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation). - Projecting the data from the other conditions onto that subspace (
t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)).
For arbitrary experimental designs, we can perform the centering with a linear model fit:
# A complex experimental design
lm_fit <- lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
centered_mat <- t(residuals(lm_fit))
# Assuming that `is_reference_condition` contains a selection of the cells
ref_pca <- irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
proj_mat <- t(ref_pca$rotation) %*% centered_matAutomatic integration
Harmony
One popular tool for data integration is Harmony (Korsunsky et al. 2019). Harmony is build around maximum diversity clustering Figure 3, which in addition to minimizing the distance of each data point to a cluster center also maximizes the mixing of conditions assigned to each cluster.
harm_mat <- harmony:::RunHarmony(reducedDim(sce, "PCA"), colData(sce),
vars_use = c("stim"))Transposing data matrix
Initializing state using k-means centroids initialization
Harmony 1/10
Harmony 2/10
Harmony 3/10
Harmony 4/10
Harmony converged after 4 iterations
harm_umap <- uwot::umap(harm_mat)
as_tibble(colData(sce)) |>
mutate(umap = harm_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()MNN
MNN is short for mutual nearest neighbors and was invented for integrating two conditions by identifying the cells which are mutually nearest neighbors Figure 4. The batchelor provides an efficient implementation which can also handle experimental designs with more than two conditions.
mnn_sce <- batchelor::fastMNN(sce, batch = sce$stim, BSPARAM=BiocSingular::IrlbaParam())
mnn_umap <- uwot::umap(reducedDim(mnn_sce, "corrected"))
as_tibble(colData(sce)) |>
mutate(umap = mnn_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()Invertible Integration
Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. But there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counter factual what the expression of a cell from the control condition would have been, had it been treated.
A new tool called LEMUR provides this functionality by matching the subspace of each condition (Ahlmann-Eltze and Huber 2023). Figure 5 illustrates the principle.
LEMUR takes as input a SingleCellExperiment object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we will use the provided cell type annotations.
fit <- lemur::lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit <- lemur::align_by_grouping(fit, fit$colData$cell, verbose = FALSE)fit <- lemur::lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit <- lemur::align_harmony(fit, verbose = FALSE)Making a UMAP plot of LEMUR’s embedding shows that it sucessfully integrated the conditions (Figure 6).
lemur_umap <- uwot::umap(t(fit$embedding))
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()The advantage of the invertible integration is that we can make predictions what the expression of a cell from the control condition would have been, had it been stimulated and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.
We call LEMUR’s test_de function to compare the expression values in the "stim" and "ctrl" conditions.
fit <- lemur::test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))We can now pick individual genes and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values ?@fig-lemur_plot_de.
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(expr = logcounts(fit)["PLSCR1",]) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = expr), size = 0.3) +
scale_color_viridis_c() +
facet_wrap(vars(stim)) +
coord_fixed()
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")["PLSCR1",]) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2() +
coord_fixed()nei <- lemur::find_de_neighborhoods(fit, group_by = vars(stim, ind), verbose = FALSE)
as_tibble(nei) %>%
arrange(pval)# A tibble: 500 × 13
name neighborhood n_cells sel_statistic pval adj_pval f_statistic df1
<chr> <I<list>> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 IL1RN <chr> 5156 426. 5.28e-19 2.64e-16 1612. 1
2 MT2A <chr> 27129 279. 1.71e-17 4.27e-15 1088. 1
3 NT5C3A <chr> 11862 354. 3.05e-17 5.08e-15 1019. 1
4 EIF2A… <chr> 28815 608. 6.69e-17 8.37e-15 932. 1
5 IFIT2 <chr> 8536 294. 1.45e-16 1.45e-14 853. 1
6 CXCL11 <chr> 7761 372. 6.99e-16 5.41e-14 713. 1
7 ISG20 <chr> 7988 741. 7.57e-16 5.41e-14 707. 1
8 IFI35 <chr> 8357 495. 1.07e-15 6.01e-14 680. 1
9 SSB <chr> 4677 451. 1.08e-15 6.01e-14 678. 1
10 IFI16 <chr> 28137 420. 1.20e-15 6.01e-14 670. 1
# ℹ 490 more rows
# ℹ 5 more variables: df2 <dbl>, lfc <dbl>, did_pval <dbl>, did_adj_pval <dbl>,
# did_lfc <dbl>
Session Info
sessionInfo()R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.7.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] muscData_1.14.0 ExperimentHub_2.8.0
[3] AnnotationHub_3.8.0 BiocFileCache_2.8.0
[5] dbplyr_2.3.2 SingleCellExperiment_1.22.0
[7] SummarizedExperiment_1.30.2 Biobase_2.60.0
[9] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0
[11] IRanges_2.34.0 S4Vectors_0.38.1
[13] BiocGenerics_0.46.0 MatrixGenerics_1.12.2
[15] matrixStats_1.0.0 lubridate_1.9.2
[17] forcats_1.0.0 stringr_1.5.0
[19] dplyr_1.1.2 purrr_1.0.1
[21] readr_2.1.4 tidyr_1.3.0
[23] tibble_3.2.1 ggplot2_3.4.4
[25] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.14 jsonlite_1.8.8
[3] magrittr_2.0.3 ggbeeswarm_0.7.2
[5] farver_2.1.1 rmarkdown_2.22
[7] zlibbioc_1.46.0 vctrs_0.6.2
[9] memoise_2.0.1 DelayedMatrixStats_1.22.0
[11] RCurl_1.98-1.12 htmltools_0.5.5
[13] S4Arrays_1.0.4 curl_5.0.1
[15] BiocNeighbors_1.18.0 htmlwidgets_1.6.2
[17] cachem_1.0.8 ResidualMatrix_1.10.0
[19] igraph_1.4.3 mime_0.12
[21] lifecycle_1.0.3 pkgconfig_2.0.3
[23] rsvd_1.0.5 Matrix_1.6-4
[25] R6_2.5.1 fastmap_1.1.1
[27] GenomeInfoDbData_1.2.10 shiny_1.7.4
[29] digest_0.6.31 colorspace_2.1-0
[31] AnnotationDbi_1.62.1 scater_1.28.0
[33] irlba_2.3.5.1 RSQLite_2.3.1
[35] beachmat_2.16.0 filelock_1.0.2
[37] labeling_0.4.2 fansi_1.0.4
[39] timechange_0.2.0 httr_1.4.6
[41] compiler_4.3.0 bit64_4.0.5
[43] withr_2.5.0 BiocParallel_1.34.2
[45] viridis_0.6.3 DBI_1.1.3
[47] rappdirs_0.3.3 DelayedArray_0.26.3
[49] lemur_1.0.5 tools_4.3.0
[51] vipor_0.4.5 beeswarm_0.4.0
[53] interactiveDisplayBase_1.38.0 httpuv_1.6.11
[55] glmGamPoi_1.13.3 glue_1.6.2
[57] batchelor_1.16.0 promises_1.2.0.1
[59] grid_4.3.0 generics_0.1.3
[61] gtable_0.3.3 tzdb_0.4.0
[63] hms_1.1.3 BiocSingular_1.16.0
[65] ScaledMatrix_1.8.1 utf8_1.2.3
[67] XVector_0.40.0 RcppAnnoy_0.0.20
[69] ggrepel_0.9.3 BiocVersion_3.17.1
[71] pillar_1.9.0 later_1.3.1
[73] splines_4.3.0 lattice_0.21-8
[75] bit_4.0.5 tidyselect_1.2.0
[77] Biostrings_2.68.1 scuttle_1.10.1
[79] knitr_1.43 gridExtra_2.3
[81] RhpcBLASctl_0.23-42 xfun_0.39
[83] stringi_1.7.12 yaml_2.3.7
[85] evaluate_0.21 codetools_0.2-19
[87] BiocManager_1.30.20 cli_3.6.1
[89] uwot_0.1.14 xtable_1.8-4
[91] munsell_0.5.0 harmony_1.2.0
[93] Rcpp_1.0.10 png_0.1-8
[95] parallel_4.3.0 ellipsis_0.3.2
[97] blob_1.2.4 sparseMatrixStats_1.13.4
[99] bitops_1.0-7 viridisLite_0.4.2
[101] scales_1.2.1 crayon_1.5.2
[103] rlang_1.1.1 cowplot_1.1.1
[105] KEGGREST_1.40.0